Rows: 16,800,000
Columns: 16
$ job_num <int> 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 10…
$ method <fct> no_covs, all_covs, p_hacked, r, partial_r, full_lm, lass…
$ simulation_id <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,…
$ estimate <dbl> 0.2158, 0.1872, 0.2218, 0.1812, 0.1812, 0.1812, 0.1872, …
$ SE <dbl> 0.157, 0.151, 0.157, 0.151, 0.151, 0.151, 0.151, 0.156, …
$ p_value <dbl> 0.171, 0.218, 0.159, 0.234, 0.234, 0.234, 0.217, 0.515, …
$ ndf <int> 1, 5, 3, 2, 2, 2, 4, 1, 5, 4, 2, 2, 2, 3, 1, 5, 1, 2, 2,…
$ ddf <int> 148, 144, 146, 147, 147, 147, 145, 148, 144, 145, 147, 1…
$ covs_tpr <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,…
$ covs_fpr <dbl> 0.000, 1.000, 0.667, 0.000, 0.000, 0.000, 0.667, 0.000, …
$ n_obs <int> 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 1…
$ b_x <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_covs <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ r_ycov <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0…
$ p_good_covs <dbl> 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.…
$ r_cov <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0…
Analysis of Covariate Results
Background
We ran simulations to generate data, fit linear models using different selection methods for covariates, and extract the results. In each simulation, we generated data with a dichotomous \(X\). We generated the covariates and \(Y\) from a multivariate normal distribution with \(\mu = 0\) and a covariance matrix based on the correlation values listed below.
We manipulated the following variables:
| Variable | Description | Values |
|---|---|---|
| b_x | The population parameter for \(X\) | 0, 0.3, 0.5 |
| n_obs | Number of observations in a sample | 100, 150, 200, 300, 400 |
| n_covs | Number of total available covariates | 4, 8, 12, 16, 20 |
| p_good_covs | Proportion of “good” covariates* | 0.25, 0.50, 0.75 |
| r_ycov | Correlation between \(Y\) and the “good” covariates | 0.3, 0.5 |
| r_cov | Correlation between the “good” covariates* | 0.3 |
* Note: here we define “good” covariates as ones that have a nonzero relationship with \(Y\)
We fully crossed all levels, yielding 450 unique research settings.
We used the following 7 methods to select covariates to include in a linear model:
- No covariates
- All covariates
- P-hacking
- Bivariate correlation
- Partial correlation (including \(X\) in the model)
- Full lm (including \(X\) in the model)
- LASSO (including \(X\) in the model)
We fit a linear model for each method and from the model output, extracted the estimate for \(X\), standard error of this estimate, and p-value of this \(X\) effect.
We repeated this 20,000 times for each research setting.
We present the results here.
Data Analysis
Preview data
The total amount of data simulated equals 63,000,000 observations = 450 unique research settings \(\times\) 20,000 simulations each \(\times\) 7 methods. We will consider the data separated by the \(X\) effect.
Data for \(b_x = 0\), as an example, is shown below.
Zero \(X\) Effect
First, we look at the zero \(X\) effect condition to compare the Type I errors across methods and research settings.
Type I error
We will look at the overall Type I error across methods, then will compare the error of each method across each of the manipulated variables: n_obs, n_covs, p_good_covs, r_ycov, and r_cov.
by method
We will first consider the Type I error by the selection method. Here we calculate the proportion of significant effects (\(p < 0.05\)), the Type I error, displayed below as a bar plot.
From this bar plot, we see that the p-hacking method for selecting covariates leads to inflated Type I error rates. The no covariates, all covariates, and bivariate correlation (“r”) approaches are all at the expected 0.05 mark (\(\alpha = 0.05\)), while the partial correlation (“partial r”) approach shows slight inflation, with full lm and lasso showing further inflation of Type I error.
distributions
Here we view the distributions of the Type I error rate by method, beginning by isolating the p-hacked method.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We see that while the average Type I error for the p-hacked method was 0.198, it reached as high as 0.463, further emphasizing the inflation of error.
Removing this method, we can view the distributions of the remaining 6 methods.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We see that no covariates, all covariates, and r selection methods show normal distributions centered around 0.05, while partial r, full lm, and lasso are slightly right-skewed, with full lm having the greatest skew.
by n_obs
We will view the Type I error rates of each method for the different levels of the number of observations in a sample. In the table below, we see the minimum, maximum, and average Type I error for each level of n_obs by each method.
| n_obs | method | typeI_min | typeI_max | typeI_mean |
|---|---|---|---|---|
| 50 | no_covs | 0.047 | 0.053 | 0.050 |
| 50 | all_covs | 0.047 | 0.053 | 0.050 |
| 50 | p_hacked | 0.086 | 0.428 | 0.215 |
| 50 | r | 0.047 | 0.054 | 0.050 |
| 50 | partial_r | 0.048 | 0.072 | 0.057 |
| 50 | full_lm | 0.055 | 0.096 | 0.074 |
| 50 | lasso | 0.051 | 0.090 | 0.070 |
| 100 | no_covs | 0.047 | 0.053 | 0.050 |
| 100 | all_covs | 0.048 | 0.054 | 0.050 |
| 100 | p_hacked | 0.078 | 0.447 | 0.203 |
| 100 | r | 0.048 | 0.053 | 0.050 |
| 100 | partial_r | 0.050 | 0.057 | 0.052 |
| 100 | full_lm | 0.051 | 0.086 | 0.063 |
| 100 | lasso | 0.050 | 0.066 | 0.058 |
| 150 | no_covs | 0.046 | 0.052 | 0.049 |
| 150 | all_covs | 0.047 | 0.053 | 0.050 |
| 150 | p_hacked | 0.075 | 0.459 | 0.198 |
| 150 | r | 0.048 | 0.054 | 0.050 |
| 150 | partial_r | 0.048 | 0.056 | 0.051 |
| 150 | full_lm | 0.048 | 0.083 | 0.058 |
| 150 | lasso | 0.049 | 0.063 | 0.055 |
| 200 | no_covs | 0.048 | 0.052 | 0.050 |
| 200 | all_covs | 0.048 | 0.053 | 0.050 |
| 200 | p_hacked | 0.072 | 0.458 | 0.194 |
| 200 | r | 0.048 | 0.053 | 0.050 |
| 200 | partial_r | 0.048 | 0.056 | 0.051 |
| 200 | full_lm | 0.049 | 0.073 | 0.057 |
| 200 | lasso | 0.049 | 0.060 | 0.054 |
| 300 | no_covs | 0.047 | 0.053 | 0.050 |
| 300 | all_covs | 0.047 | 0.053 | 0.050 |
| 300 | p_hacked | 0.074 | 0.452 | 0.189 |
| 300 | r | 0.048 | 0.054 | 0.050 |
| 300 | partial_r | 0.049 | 0.054 | 0.051 |
| 300 | full_lm | 0.049 | 0.061 | 0.053 |
| 300 | lasso | 0.049 | 0.056 | 0.052 |
| 400 | no_covs | 0.048 | 0.052 | 0.050 |
| 400 | all_covs | 0.047 | 0.054 | 0.050 |
| 400 | p_hacked | 0.070 | 0.463 | 0.186 |
| 400 | r | 0.047 | 0.053 | 0.050 |
| 400 | partial_r | 0.047 | 0.053 | 0.051 |
| 400 | full_lm | 0.048 | 0.058 | 0.053 |
| 400 | lasso | 0.048 | 0.056 | 0.052 |
Looking at the average column, we see that the Type I error is not affected much for the no covariates, all covariates, r, and partial r approaches across different sample sizes. However, for p-hacked, full lm, and lasso, the Type I error does decrease as sample size increases. This can be visualized in the plot below.
Note: For the line plots throughout this report, a solid line will indicate a method that does not involve performing selection of covariates (i.e., using no or all covariates). A dashed line will indicate a method that does involve performing a non-trivial selection of covariates. A dotted black line will indicate the expected value (if applicable). For example, here we see a dotted line at \(\alpha = 0.05\), the expected Type I error rate.
Again, we see the inflation of Type I error for p-hacking. For smaller sample sizes, we see that lasso and full lm perform worse than the other methods, but the methods become comparable as sample size increases.
by n_covs
We will view the Type I error rates of each method for the different number of available covariates (Note: this is not necessarily the number of covariates included in the model.). In the table below, we see the minimum, maximum, and average Type I error for each level of n_covs by each method.
| n_covs | method | typeI_min | typeI_max | typeI_mean |
|---|---|---|---|---|
| 4 | no_covs | 0.046 | 0.053 | 0.050 |
| 4 | all_covs | 0.048 | 0.053 | 0.050 |
| 4 | p_hacked | 0.070 | 0.139 | 0.098 |
| 4 | r | 0.048 | 0.054 | 0.050 |
| 4 | partial_r | 0.048 | 0.059 | 0.051 |
| 4 | full_lm | 0.048 | 0.062 | 0.053 |
| 4 | lasso | 0.048 | 0.061 | 0.052 |
| 8 | no_covs | 0.047 | 0.053 | 0.049 |
| 8 | all_covs | 0.047 | 0.054 | 0.050 |
| 8 | p_hacked | 0.096 | 0.236 | 0.150 |
| 8 | r | 0.048 | 0.053 | 0.050 |
| 8 | partial_r | 0.048 | 0.061 | 0.052 |
| 8 | full_lm | 0.049 | 0.084 | 0.056 |
| 8 | lasso | 0.049 | 0.068 | 0.054 |
| 12 | no_covs | 0.047 | 0.052 | 0.050 |
| 12 | all_covs | 0.047 | 0.053 | 0.050 |
| 12 | p_hacked | 0.118 | 0.332 | 0.202 |
| 12 | r | 0.047 | 0.053 | 0.050 |
| 12 | partial_r | 0.047 | 0.063 | 0.052 |
| 12 | full_lm | 0.048 | 0.091 | 0.060 |
| 12 | lasso | 0.048 | 0.073 | 0.057 |
| 16 | no_covs | 0.048 | 0.053 | 0.050 |
| 16 | all_covs | 0.047 | 0.054 | 0.050 |
| 16 | p_hacked | 0.142 | 0.408 | 0.249 |
| 16 | r | 0.048 | 0.054 | 0.050 |
| 16 | partial_r | 0.049 | 0.068 | 0.053 |
| 16 | full_lm | 0.050 | 0.095 | 0.064 |
| 16 | lasso | 0.049 | 0.082 | 0.059 |
| 20 | no_covs | 0.046 | 0.053 | 0.050 |
| 20 | all_covs | 0.047 | 0.053 | 0.050 |
| 20 | p_hacked | 0.164 | 0.463 | 0.290 |
| 20 | r | 0.047 | 0.053 | 0.050 |
| 20 | partial_r | 0.048 | 0.072 | 0.053 |
| 20 | full_lm | 0.050 | 0.096 | 0.066 |
| 20 | lasso | 0.051 | 0.090 | 0.061 |
Looking at the average column, we see that the Type I error is not affected much for the no covariates, all covariates, r, and partial r approaches across different amounts of available covariates. However, for p-hacked, full lm, and lasso, the Type I error increases as the number of covariates increases. This can be visualized in the plot below.
We see the large increase in error for the p-hacked method and the slight increase for full lm and lasso methods as the number of covariates increases, while the other methods stay around 0.05.
by p_good_covs
We will view the Type I error rates of each method for the different proportions of good covariates. In the table below, we see the minimum, maximum, and average Type I error for each level of p_good_covs by each method.
| p_good_covs | method | typeI_min | typeI_max | typeI_mean |
|---|---|---|---|---|
| 0.25 | no_covs | 0.047 | 0.053 | 0.050 |
| 0.25 | all_covs | 0.047 | 0.054 | 0.050 |
| 0.25 | p_hacked | 0.070 | 0.343 | 0.163 |
| 0.25 | r | 0.047 | 0.054 | 0.050 |
| 0.25 | partial_r | 0.049 | 0.072 | 0.053 |
| 0.25 | full_lm | 0.049 | 0.090 | 0.057 |
| 0.25 | lasso | 0.050 | 0.090 | 0.058 |
| 0.50 | no_covs | 0.046 | 0.053 | 0.050 |
| 0.50 | all_covs | 0.047 | 0.054 | 0.050 |
| 0.50 | p_hacked | 0.082 | 0.400 | 0.202 |
| 0.50 | r | 0.048 | 0.053 | 0.050 |
| 0.50 | partial_r | 0.048 | 0.064 | 0.052 |
| 0.50 | full_lm | 0.048 | 0.095 | 0.060 |
| 0.50 | lasso | 0.049 | 0.085 | 0.057 |
| 0.75 | no_covs | 0.047 | 0.053 | 0.050 |
| 0.75 | all_covs | 0.047 | 0.053 | 0.050 |
| 0.75 | p_hacked | 0.091 | 0.463 | 0.228 |
| 0.75 | r | 0.047 | 0.054 | 0.050 |
| 0.75 | partial_r | 0.047 | 0.059 | 0.051 |
| 0.75 | full_lm | 0.048 | 0.096 | 0.062 |
| 0.75 | lasso | 0.048 | 0.080 | 0.055 |
Looking at the average column, we see that the Type I error is not affected for the no covariates, all covariates, and r approaches across different proportions. However, for p-hacked, partial r, full lm, and lasso, the Type I error changes as the number of covariates increases. This can be visualized in the plots below.
In this plot, we mainly see the increase in Type I error for the p-hacking approach as the proportion of good covariates increases. As there are more good covariates, the Type I error decreases for lasso and partial r, but it increases for full lm.
by correlations
Recall that in these simulations, the correlation among the good covariates was held constant at 0.3. So, we will look at the Type I error rates of each method by the correlation between \(Y\) and the good covariates.
| r_ycov | method | typeI_min | typeI_max | typeI_mean |
|---|---|---|---|---|
| 0.3 | no_covs | 0.047 | 0.053 | 0.050 |
| 0.3 | all_covs | 0.047 | 0.054 | 0.050 |
| 0.3 | p_hacked | 0.070 | 0.273 | 0.144 |
| 0.3 | r | 0.047 | 0.054 | 0.050 |
| 0.3 | partial_r | 0.049 | 0.072 | 0.053 |
| 0.3 | full_lm | 0.050 | 0.077 | 0.059 |
| 0.3 | lasso | 0.049 | 0.086 | 0.058 |
| 0.5 | no_covs | 0.046 | 0.053 | 0.050 |
| 0.5 | all_covs | 0.047 | 0.054 | 0.050 |
| 0.5 | p_hacked | 0.081 | 0.463 | 0.251 |
| 0.5 | r | 0.047 | 0.054 | 0.050 |
| 0.5 | partial_r | 0.047 | 0.061 | 0.051 |
| 0.5 | full_lm | 0.048 | 0.096 | 0.060 |
| 0.5 | lasso | 0.048 | 0.090 | 0.056 |
Looking at the average column, we see that the error does not change across correlations for the no covariates, all covariates, r, and partial r approaches. It changes slightly for full lm and lasso. And it changes drastically when p-hacking, such that a higher correlation among good covariates increases the Type I error.
In the plot, we see the Type I error for the p-hacking method increase as the correlation between \(Y\) and the good covariates increases. The other methods remain at relatively the same Type I error rate across correlation values.
Estimate, SD, & SE
Here we will compare, across methods, the estimate of \(b_x\), the standard deviation of the estimate, and the average standard error of the estimate. The standard deviation is calculated as the SD of the sampling distribution of the estimates. The standard error is from the linear model output. Since the mean of standard errors would be biased, we calculate the average SE by taking the square root of the mean of the squared standard errors. We compare the differences by subtracting this average linear model SE from the calculated SD.
| method | mean_estimate | SD_estimate | SE_mean | difference |
|---|---|---|---|---|
| no_covs | 0.000 | 0.178 | 0.178 | 0.000 |
| all_covs | 0.000 | 0.155 | 0.155 | 0.000 |
| p_hacked | 0.000 | 0.238 | 0.157 | 0.080 |
| r | 0.000 | 0.145 | 0.145 | 0.000 |
| partial_r | 0.000 | 0.147 | 0.145 | 0.003 |
| full_lm | 0.000 | 0.156 | 0.147 | 0.008 |
| lasso | 0.000 | 0.150 | 0.143 | 0.007 |
We see that all methods have an average estimate of 0, as expected given the population value for \(b_x\). We see that the standard deviation of the sampling distribution of the estimate equals the standard error for no covariates, all covariates, and r selection methods. There are small differences between these values for partial r, full lm, and lasso methods. The p-hacking shows a large difference.
Sampling Distributions
Here we view a sampling distribution of the estimate for \(b_x\) for each method.
Warning: Removed 351133 rows containing non-finite outside the scale range
(`stat_density()`).
We see again that each method’s distribution is centered around 0, but the p-hacked method is not normally distributed as it is biasing the parameter estimates. In addition, we see that the no covariates method has more variability in its estimate for \(b_x\) than the other valid selection methods.
From these primary analyses, we see that the p-hacked method leads to inflated Type I error rates and biased parameter estimates. We can conclude that it is not a statistically valid method. For the following analyses, we will not include the p-hacked method. While partial r, full lm, and lasso selection methods showed slight inflation of Type I error, we might be willing to accept this for greater reductions in Type II error, which we will compare in the next section.
Nonzero \(X\) Effect
Next, we look at the nonzero \(X\) effect condition to compare the Type II errors across methods and research settings. Recall that we set two values for \(b_x\) of 0.3 and 0.5.
Type II Error
We will look at the overall Type II error across methods (except p-hacked), then will again compare the error of each method across each of the manipulated variables: n_obs, n_covs, p_good_covs, r_ycov, and r_cov.
by method
We will first consider the Type II error by the selection method, for both \(b_x = 0.3\) and \(b_x = 0.5\). Here we calculate the proportion of non-significant effects (\(p \geq 0.05\)), the Type II error, displayed below as bar plots.
b_x = 0.3
In this first plot for \(b_x = 0.3\), we see that the Type II error is highest when no covariates are used in the model. There is a large reduction in Type II error when we include all covariates compared to no covariates, and a slight further reduction in Type II error when we use a selection method for covariates compared to including all.
distributions
Here we view the distributions of the Type II error rate by method, for \(b_x = 0.3\).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Aside from the no covariates approach, we see that the other methods have a similar range of Type II error.
b_x = 0.5
We see a similar trend as above, of a large reduction in Type II error when including all covariates compared to no covariates, and another reduction when selecting covariates, especially when using r, partial r, or lasso.
distributions
Here we view the distributions of the Type II error rate by method, for \(b_x = 0.5\).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Similarly, we see a similar range and distribution of Type II error for the selection methods, other than no covariates method.
by n_obs
We will view the Type II error rates of each method for the different levels of the number of observations in a sample. In the tables below, we see the minimum, maximum, and average Type II error for each level of n_obs by each method, for both \(b_x = 0.3\) and \(b_x = 0.5\).
b_x = 0.3
| n_obs | method | typeII_min | typeII_max | typeII_mean |
|---|---|---|---|---|
| 50 | no_covs | 0.814 | 0.826 | 0.820 |
| 50 | all_covs | 0.661 | 0.862 | 0.778 |
| 50 | r | 0.627 | 0.818 | 0.751 |
| 50 | partial_r | 0.622 | 0.803 | 0.732 |
| 50 | full_lm | 0.646 | 0.804 | 0.736 |
| 50 | lasso | 0.555 | 0.802 | 0.713 |
| 100 | no_covs | 0.676 | 0.687 | 0.682 |
| 100 | all_covs | 0.297 | 0.687 | 0.549 |
| 100 | r | 0.270 | 0.668 | 0.528 |
| 100 | partial_r | 0.267 | 0.661 | 0.521 |
| 100 | full_lm | 0.335 | 0.660 | 0.528 |
| 100 | lasso | 0.257 | 0.659 | 0.513 |
| 150 | no_covs | 0.547 | 0.562 | 0.554 |
| 150 | all_covs | 0.107 | 0.527 | 0.378 |
| 150 | r | 0.097 | 0.521 | 0.364 |
| 150 | partial_r | 0.097 | 0.516 | 0.359 |
| 150 | full_lm | 0.139 | 0.516 | 0.366 |
| 150 | lasso | 0.096 | 0.517 | 0.356 |
| 200 | no_covs | 0.432 | 0.447 | 0.440 |
| 200 | all_covs | 0.033 | 0.405 | 0.258 |
| 200 | r | 0.029 | 0.401 | 0.248 |
| 200 | partial_r | 0.029 | 0.399 | 0.246 |
| 200 | full_lm | 0.045 | 0.399 | 0.251 |
| 200 | lasso | 0.030 | 0.399 | 0.245 |
| 300 | no_covs | 0.258 | 0.269 | 0.264 |
| 300 | all_covs | 0.003 | 0.229 | 0.118 |
| 300 | r | 0.003 | 0.228 | 0.114 |
| 300 | partial_r | 0.003 | 0.226 | 0.113 |
| 300 | full_lm | 0.004 | 0.226 | 0.116 |
| 300 | lasso | 0.003 | 0.227 | 0.113 |
| 400 | no_covs | 0.146 | 0.156 | 0.151 |
| 400 | all_covs | 0.000 | 0.121 | 0.054 |
| 400 | r | 0.000 | 0.119 | 0.052 |
| 400 | partial_r | 0.000 | 0.119 | 0.051 |
| 400 | full_lm | 0.000 | 0.119 | 0.053 |
| 400 | lasso | 0.000 | 0.118 | 0.051 |
In this table, we see that for all methods, as the sample size increases, the Type II error decreases. This can be better seen in a plot below.
Here, we see the decrease in Type II error for the increase in sample size across all methods. We see that the no covariates approach has the highest Type II error. From both the table and the plot, we see that for smaller sample sizes, including all covariates in the model yields higher Type II errors, but these become comparable for larger sample sizes.
b_x = 0.5
| n_obs | method | typeII_min | typeII_max | typeII_mean |
|---|---|---|---|---|
| 50 | no_covs | 0.585 | 0.598 | 0.590 |
| 50 | all_covs | 0.275 | 0.680 | 0.502 |
| 50 | r | 0.226 | 0.579 | 0.446 |
| 50 | partial_r | 0.223 | 0.555 | 0.420 |
| 50 | full_lm | 0.319 | 0.557 | 0.454 |
| 50 | lasso | 0.173 | 0.556 | 0.408 |
| 100 | no_covs | 0.298 | 0.310 | 0.304 |
| 100 | all_covs | 0.015 | 0.305 | 0.171 |
| 100 | r | 0.011 | 0.281 | 0.152 |
| 100 | partial_r | 0.011 | 0.273 | 0.147 |
| 100 | full_lm | 0.035 | 0.273 | 0.159 |
| 100 | lasso | 0.010 | 0.274 | 0.145 |
| 150 | no_covs | 0.134 | 0.146 | 0.141 |
| 150 | all_covs | 0.001 | 0.120 | 0.057 |
| 150 | r | 0.000 | 0.117 | 0.051 |
| 150 | partial_r | 0.000 | 0.114 | 0.049 |
| 150 | full_lm | 0.002 | 0.114 | 0.055 |
| 150 | lasso | 0.000 | 0.116 | 0.049 |
| 200 | no_covs | 0.056 | 0.063 | 0.059 |
| 200 | all_covs | 0.000 | 0.044 | 0.018 |
| 200 | r | 0.000 | 0.042 | 0.016 |
| 200 | partial_r | 0.000 | 0.041 | 0.016 |
| 200 | full_lm | 0.000 | 0.041 | 0.018 |
| 200 | lasso | 0.000 | 0.042 | 0.016 |
| 300 | no_covs | 0.008 | 0.011 | 0.009 |
| 300 | all_covs | 0.000 | 0.006 | 0.002 |
| 300 | r | 0.000 | 0.005 | 0.001 |
| 300 | partial_r | 0.000 | 0.005 | 0.001 |
| 300 | full_lm | 0.000 | 0.005 | 0.002 |
| 300 | lasso | 0.000 | 0.006 | 0.001 |
| 400 | no_covs | 0.001 | 0.002 | 0.001 |
| 400 | all_covs | 0.000 | 0.001 | 0.000 |
| 400 | r | 0.000 | 0.000 | 0.000 |
| 400 | partial_r | 0.000 | 0.000 | 0.000 |
| 400 | full_lm | 0.000 | 0.000 | 0.000 |
| 400 | lasso | 0.000 | 0.001 | 0.000 |
We see lower overall Type II errors, but the same trend of decreasing errors for increasing sample sizes across all methods. We can view this in the plot below.
Similarly, the no covariates approach has the highest Type II error. For small sample sizes the all covariates approach has higher error which stabilizes with increasing sample size.
by n_covs
We will view the Type II error rates of each method for the different number of available covariates. In the tables below, we see the minimum, maximum, and average Type II error for each level of n_covs by each method, for both \(b_x = 0.3\) and \(b_x = 0.5\).
b_x = 0.3
| n_covs | method | typeII_min | typeII_max | typeII_mean |
|---|---|---|---|---|
| 4 | no_covs | 0.148 | 0.822 | 0.485 |
| 4 | all_covs | 0.016 | 0.821 | 0.396 |
| 4 | r | 0.015 | 0.818 | 0.393 |
| 4 | partial_r | 0.015 | 0.803 | 0.390 |
| 4 | full_lm | 0.016 | 0.804 | 0.390 |
| 4 | lasso | 0.016 | 0.802 | 0.390 |
| 8 | no_covs | 0.149 | 0.822 | 0.485 |
| 8 | all_covs | 0.004 | 0.826 | 0.362 |
| 8 | r | 0.004 | 0.815 | 0.355 |
| 8 | partial_r | 0.004 | 0.790 | 0.350 |
| 8 | full_lm | 0.004 | 0.789 | 0.350 |
| 8 | lasso | 0.004 | 0.787 | 0.348 |
| 12 | no_covs | 0.151 | 0.822 | 0.485 |
| 12 | all_covs | 0.001 | 0.834 | 0.345 |
| 12 | r | 0.001 | 0.815 | 0.332 |
| 12 | partial_r | 0.001 | 0.785 | 0.326 |
| 12 | full_lm | 0.001 | 0.785 | 0.330 |
| 12 | lasso | 0.001 | 0.774 | 0.322 |
| 16 | no_covs | 0.146 | 0.822 | 0.485 |
| 16 | all_covs | 0.000 | 0.841 | 0.339 |
| 16 | r | 0.000 | 0.813 | 0.320 |
| 16 | partial_r | 0.000 | 0.783 | 0.314 |
| 16 | full_lm | 0.000 | 0.780 | 0.321 |
| 16 | lasso | 0.000 | 0.761 | 0.306 |
| 20 | no_covs | 0.147 | 0.826 | 0.485 |
| 20 | all_covs | 0.000 | 0.862 | 0.337 |
| 20 | r | 0.000 | 0.816 | 0.313 |
| 20 | partial_r | 0.000 | 0.792 | 0.306 |
| 20 | full_lm | 0.000 | 0.784 | 0.318 |
| 20 | lasso | 0.000 | 0.758 | 0.293 |
From the average column, we can see that Type II error decreases as the number of covariates increases across all methods, except the no covariates method as this does not dependent on the number of covariates. We can see these trends in the plot below.
In this plot, as the number of covariates increases, we see larger reductions in Type II error when selecting the covariates compared to including all available covariates. Lasso performs the best with higher numbers of covariates, followed by partial r, r, and full lm in order. Again, using no covariates yields the highest Type II error.
b_x = 0.5
| n_covs | method | typeII_min | typeII_max | typeII_mean |
|---|---|---|---|---|
| 4 | no_covs | 0.001 | 0.598 | 0.184 |
| 4 | all_covs | 0.000 | 0.586 | 0.134 |
| 4 | r | 0.000 | 0.578 | 0.132 |
| 4 | partial_r | 0.000 | 0.555 | 0.128 |
| 4 | full_lm | 0.000 | 0.557 | 0.130 |
| 4 | lasso | 0.000 | 0.556 | 0.129 |
| 8 | no_covs | 0.001 | 0.592 | 0.184 |
| 8 | all_covs | 0.000 | 0.601 | 0.121 |
| 8 | r | 0.000 | 0.575 | 0.114 |
| 8 | partial_r | 0.000 | 0.536 | 0.109 |
| 8 | full_lm | 0.000 | 0.541 | 0.114 |
| 8 | lasso | 0.000 | 0.535 | 0.109 |
| 12 | no_covs | 0.001 | 0.595 | 0.185 |
| 12 | all_covs | 0.000 | 0.626 | 0.120 |
| 12 | r | 0.000 | 0.570 | 0.106 |
| 12 | partial_r | 0.000 | 0.522 | 0.101 |
| 12 | full_lm | 0.000 | 0.544 | 0.110 |
| 12 | lasso | 0.000 | 0.519 | 0.099 |
| 16 | no_covs | 0.001 | 0.594 | 0.184 |
| 16 | all_covs | 0.000 | 0.657 | 0.122 |
| 16 | r | 0.000 | 0.579 | 0.102 |
| 16 | partial_r | 0.000 | 0.523 | 0.096 |
| 16 | full_lm | 0.000 | 0.543 | 0.109 |
| 16 | lasso | 0.000 | 0.515 | 0.092 |
| 20 | no_covs | 0.001 | 0.594 | 0.183 |
| 20 | all_covs | 0.000 | 0.680 | 0.128 |
| 20 | r | 0.000 | 0.570 | 0.100 |
| 20 | partial_r | 0.000 | 0.530 | 0.094 |
| 20 | full_lm | 0.000 | 0.542 | 0.111 |
| 20 | lasso | 0.000 | 0.494 | 0.087 |
We see the same decrease in Type II errors for increases in number of covariates. We can visualize this in the plot below.
Similarly, we see the selection methods performing better than using all covariates. For higher numbers of covariates, we see lasso has the lowest Type II error, followed by partial r, r, and full lm. Using no covariates yields the highest Type II error.
by p_good_covs
We will view the Type II error rates of each method for the different proportions of “good” covariates. In the table below, we see the minimum, maximum, and average Type II error for each level of p_good_covs by each method, for both \(b_x = 0.3\) and \(b_x = 0.5\).
b_x = 0.3
| p_good_covs | method | typeII_min | typeII_max | typeII_mean |
|---|---|---|---|---|
| 0.25 | no_covs | 0.147 | 0.826 | 0.485 |
| 0.25 | all_covs | 0.007 | 0.862 | 0.393 |
| 0.25 | r | 0.006 | 0.818 | 0.375 |
| 0.25 | partial_r | 0.006 | 0.803 | 0.367 |
| 0.25 | full_lm | 0.005 | 0.804 | 0.367 |
| 0.25 | lasso | 0.006 | 0.802 | 0.363 |
| 0.50 | no_covs | 0.148 | 0.822 | 0.486 |
| 0.50 | all_covs | 0.001 | 0.856 | 0.349 |
| 0.50 | r | 0.001 | 0.816 | 0.336 |
| 0.50 | partial_r | 0.001 | 0.797 | 0.330 |
| 0.50 | full_lm | 0.001 | 0.795 | 0.335 |
| 0.50 | lasso | 0.001 | 0.795 | 0.326 |
| 0.75 | no_covs | 0.146 | 0.822 | 0.485 |
| 0.75 | all_covs | 0.000 | 0.853 | 0.326 |
| 0.75 | r | 0.000 | 0.809 | 0.317 |
| 0.75 | partial_r | 0.000 | 0.792 | 0.314 |
| 0.75 | full_lm | 0.000 | 0.789 | 0.322 |
| 0.75 | lasso | 0.000 | 0.788 | 0.307 |
From the average column, we see decreases in Type II error rates for increases in the proportion of good covariates across all methods – except no covariates, as this is independent of the proportion of good covariates. We can visualize the trends more clearly in the plot below.
In addition to the decrease in error mentioned above, we see that including all covariates has a higher Type II error than selection the covariates, especially for lower proportions of good covariates. For larger proportions of good covariates, we see that lasso has the lowest Type II error and that using all covariates and full lm become comparable.
b_x = 0.5
| p_good_covs | method | typeII_min | typeII_max | typeII_mean |
|---|---|---|---|---|
| 0.25 | no_covs | 0.001 | 0.594 | 0.184 |
| 0.25 | all_covs | 0.000 | 0.680 | 0.144 |
| 0.25 | r | 0.000 | 0.579 | 0.125 |
| 0.25 | partial_r | 0.000 | 0.555 | 0.117 |
| 0.25 | full_lm | 0.000 | 0.557 | 0.123 |
| 0.25 | lasso | 0.000 | 0.556 | 0.117 |
| 0.50 | no_covs | 0.001 | 0.595 | 0.184 |
| 0.50 | all_covs | 0.000 | 0.675 | 0.121 |
| 0.50 | r | 0.000 | 0.572 | 0.108 |
| 0.50 | partial_r | 0.000 | 0.544 | 0.102 |
| 0.50 | full_lm | 0.000 | 0.549 | 0.112 |
| 0.50 | lasso | 0.000 | 0.545 | 0.100 |
| 0.75 | no_covs | 0.001 | 0.598 | 0.184 |
| 0.75 | all_covs | 0.000 | 0.663 | 0.110 |
| 0.75 | r | 0.000 | 0.572 | 0.100 |
| 0.75 | partial_r | 0.000 | 0.544 | 0.097 |
| 0.75 | full_lm | 0.000 | 0.550 | 0.109 |
| 0.75 | lasso | 0.000 | 0.545 | 0.092 |
From the table, we see decreases in Type II error for higher proportions of good covariates for methods that do include covariates. We can see the details in the plot below.
We again see that including all covariates has a higher Type II error rate than selecting covariates to include, although this method improves for higher proportions of good covariates. Partial r performs best for a lower proportion of good covariates while lasso performs best for a higher proportion. Similarly, full lm and all covariates approaches become comparable as the proportion increases.
by correlations
Recall that in these simulations, the correlation among the good covariates was held constant at 0.3. So, we will look at the Type II error rates of each method by the correlation between \(Y\) and the good covariates, for both \(b_x = 0.3\) and \(b_x = 0.5\).
b_x = 0.3
| r_ycov | method | typeII_min | typeII_max | typeII_mean |
|---|---|---|---|---|
| 0.3 | no_covs | 0.149 | 0.826 | 0.485 |
| 0.3 | all_covs | 0.076 | 0.862 | 0.442 |
| 0.3 | r | 0.073 | 0.818 | 0.430 |
| 0.3 | partial_r | 0.073 | 0.803 | 0.422 |
| 0.3 | full_lm | 0.084 | 0.804 | 0.425 |
| 0.3 | lasso | 0.072 | 0.802 | 0.415 |
| 0.5 | no_covs | 0.146 | 0.822 | 0.485 |
| 0.5 | all_covs | 0.000 | 0.788 | 0.270 |
| 0.5 | r | 0.000 | 0.779 | 0.256 |
| 0.5 | partial_r | 0.000 | 0.771 | 0.252 |
| 0.5 | full_lm | 0.000 | 0.771 | 0.258 |
| 0.5 | lasso | 0.000 | 0.767 | 0.249 |
In the table, we see that the Type II error decreases as the correlation between \(Y\) and the good covariates increases. The Type II error is highest for including no covariates. We can visualize this below.
In the plots, we see that the no covariates method has the highest Type II error across correlation levels. The Type II errors decrease as the correlation between \(Y\) and the good covariates increases. We also see slight decreases in Type II error from including all covariates to selecting them.
b_x = 0.5
| r_ycov | method | typeII_min | typeII_max | typeII_mean |
|---|---|---|---|---|
| 0.3 | no_covs | 0.001 | 0.598 | 0.184 |
| 0.3 | all_covs | 0.000 | 0.680 | 0.170 |
| 0.3 | r | 0.000 | 0.579 | 0.154 |
| 0.3 | partial_r | 0.000 | 0.555 | 0.146 |
| 0.3 | full_lm | 0.000 | 0.557 | 0.153 |
| 0.3 | lasso | 0.000 | 0.556 | 0.142 |
| 0.5 | no_covs | 0.001 | 0.595 | 0.184 |
| 0.5 | all_covs | 0.000 | 0.517 | 0.080 |
| 0.5 | r | 0.000 | 0.499 | 0.068 |
| 0.5 | partial_r | 0.000 | 0.488 | 0.065 |
| 0.5 | full_lm | 0.000 | 0.486 | 0.076 |
| 0.5 | lasso | 0.000 | 0.487 | 0.064 |
From the table, we see that the Type II error decreases as the correlation between \(Y\) and the good covariates increases.
In the plot, we see again that the no covariates approach has the highest Type II error across correlations. There is a slight decrease in Type II error when selecting covariates instead of using all covariates. Among the selection methods, full lm has the higher Type II error for a lower \(Y\)-covariate correlation, but the errors become comparable for a higher correlation.
Estimate, SD, & SE
Here we will compare, across methods, the estimate of \(b_x\), the standard deviation of the estimate, and the average standard error of the estimate. The standard deviation is calculated as the SD of the sampling distribution of the estimates. The standard error is from the linear model output. Since the mean of standard errors would be biased, we calculate the average SE by taking the square root of the mean of the squared standard errors. We compare the differences by subtracting this average linear model SE from the calculated SD.
b_x = 0.3
| method | mean_estimate | SD_estimate | mean_SE | difference |
|---|---|---|---|---|
| no_covs | 0.300 | 0.178 | 0.178 | 0.000 |
| all_covs | 0.300 | 0.155 | 0.155 | 0.000 |
| r | 0.297 | 0.145 | 0.145 | 0.000 |
| partial_r | 0.300 | 0.147 | 0.145 | 0.003 |
| full_lm | 0.300 | 0.156 | 0.147 | 0.008 |
| lasso | 0.300 | 0.150 | 0.143 | 0.007 |
Here we see that all methods correctly estimate \(b_x\) to be 0.3, except the r approach which yielded a slightly lower average estimate. The no covariates, all covariates, and r approaches show no difference between the calculated SD and the linear model SE, while partial r, full lm, and lasso approaches show slight differences.
b_x = 0.5
| method | mean_estimate | SD_estimate | mean_SE | difference |
|---|---|---|---|---|
| no_covs | 0.500 | 0.178 | 0.178 | 0.000 |
| all_covs | 0.500 | 0.155 | 0.155 | 0.000 |
| r | 0.495 | 0.145 | 0.145 | 0.000 |
| partial_r | 0.500 | 0.147 | 0.145 | 0.003 |
| full_lm | 0.500 | 0.155 | 0.147 | 0.008 |
| lasso | 0.500 | 0.149 | 0.143 | 0.007 |
Similarly, we see all methods correctly estimate \(b_x\) to be 0.5, except the r approach which again yielded a slightly lower estimate. The no covariates, all covariates, and r approaches show no difference between the calculated SD and the linear model SE, while partial r, full lm, and lasso approaches show slight differences.
Sampling Distributions
Here we view sampling distributions of the estimate for \(b_x\) for each method, for both \(b_x = 0.3\) and \(b_x = 0.5\).
b_x = 0.3
Warning: Removed 210985 rows containing non-finite outside the scale range
(`stat_density()`).
In the plot, we can see that the distributions for all methods are centered around 0.3. The no covariates approach again has the widest distribution.
b_x = 0.5
Warning: Removed 209363 rows containing non-finite outside the scale range
(`stat_density()`).
In the plot, we can see that the distributions for all methods are centered around 0.5. The no covariates approach again has the widest distribution.
From these secondary analyses, we see that using no covariates leads to the highest Type II error rates across research settings. Using all covariates provides reduction in Type II error (and thus, more power), but using a non-trivial selection method leads to further reductions in Type II error.
Good covariates true positive & false positive rates
Finally, we will look at the true positive rate (TPR) and false positive rate (FPR) for including good covariates by each method to compare how well each method found/included covariates that were part of the data generating process. Ideally, we want a high TPR, indicating that the method had a high proportion of finding/including the good covariates. We want a low FPR, indicating that the method had a low proportion of including the bad covariates. These rates are displayed in the tables below. We include the p-hacked method again.
It also makes sense to consider the TPR & FPR of including good covariates while also considering the actual number of good covariates. We will look at plots of these below as well for \(b_x = 0, 0.3, 0.5\).
True Positive Rate (TPR)
b_x = 0
| method | covs_tpr_mean |
|---|---|
| no_covs | 0.000 |
| all_covs | 1.000 |
| p_hacked | 0.496 |
| r | 0.947 |
| partial_r | 0.945 |
| full_lm | 0.641 |
| lasso | 0.888 |
On average, we see that r, partial r, and lasso perform the best, while full lm has a considerably lower TPR.
As the number of good covariates increases, we see that full lm has a large decrease in the TPR and lasso has a smaller decrease. The other methods remain relatively stable. We see that the p-hacked method has a relatively flat rate of 50%, indicating that it finds only half of the good covariates.
b_x = 0.3
| method | covs_tpr_mean |
|---|---|
| no_covs | 0.000 |
| all_covs | 1.000 |
| p_hacked | 0.656 |
| r | 0.944 |
| partial_r | 0.945 |
| full_lm | 0.641 |
| lasso | 0.888 |
Similarly, r, partial r, and lasso methods have a higher TPR than full lm.
We see a similar trend in large decreases for full lm and smaller decreases for lasso in the TPR as the number of good covariates increases.
b_x = 0.5
| method | covs_tpr_mean |
|---|---|
| no_covs | 0.000 |
| all_covs | 1.000 |
| p_hacked | 0.747 |
| r | 0.940 |
| partial_r | 0.945 |
| full_lm | 0.641 |
| lasso | 0.888 |
Again we see that r, partial r, and lasso perform the best. In this case with a larger \(X\) effect, we see partial r starting to outperform the r method in terms of TPR.
We see the same decreases in TPR for full lm and lasso as the number of good covariates increases.
False Positive Rate (FPR)
b_x = 0
| method | covs_fpr_mean |
|---|---|
| no_covs | 0.000 |
| all_covs | 1.000 |
| p_hacked | 0.466 |
| r | 0.050 |
| partial_r | 0.050 |
| full_lm | 0.050 |
| lasso | 0.313 |
For the FPR, we see that r, partial r, and full lm are all comparable, while lasso performs worse than these methods.
We see that as the number of good covariates increases, most methods remain stable in their FPR, except for lasso.
b_x = 0.3
| method | covs_fpr_mean |
|---|---|
| no_covs | 0.000 |
| all_covs | 1.000 |
| p_hacked | 0.415 |
| r | 0.050 |
| partial_r | 0.050 |
| full_lm | 0.050 |
| lasso | 0.314 |
Similarly, r, partial r, and full lm are all comparable, while lasso performs worse than these methods.
The valid selection methods are again comparable, except for lasso which has a changing FPR as the number of good covariates increases.
b_x = 0.5
| method | covs_fpr_mean |
|---|---|
| no_covs | 0.000 |
| all_covs | 1.000 |
| p_hacked | 0.375 |
| r | 0.050 |
| partial_r | 0.050 |
| full_lm | 0.050 |
| lasso | 0.314 |
Again we have that r, partial r, and full lm are all comparable, while lasso performs worse and is close to the performance of p-hacking.
The lasso method shows the same trend in FPR here, while the other methods remain stagnant.
Conclusions
We compared 7 methods for selecting covariates to include in linear models. In the first section looking at Type I errors, we demonstrated that the p-hacking approach is not a statistically valid method as it led to inflated Type I error rates and biased parameter estimates (Figure 8). The remaining 6 methods were all shown to be statistically valid, although partial r, lasso, and full lm showed slight inflation of Type I error (Figure 1). The methods can be further compared by their Type II error results. Overall, using no covariates performed the worst as it led to the highest Type II error (and thus, has the lowest power). The no covariates method also showed more variability in the estimate for \(b_x\) (Figure 8, Figure 21, Figure 22). Including all covariates led to reductions in Type II error, and using one of the selection methods led to further reductions in Type II error.
The Type II error of the selection methods can be looked at more closely for different research settings. For smaller sample sizes, we see the benefit of using a selection method over using all covariates (Figure 13, Figure 14). Lasso yielded slightly inflated Type I error, but it performs best for larger numbers of covariates (Figure 15, Figure 16) and for larger proportions of good covariates (Figure 17, Figure 18). This can be further seen when looking at the number of good covariates where we again see lasso outperform the other methods. However, we do see that as the number of good covariates increases, the true positive rate of lasso including these good covariates decreases compared to r and partial r (Figure 23). While the bivariate correlation method shows no inflation of Type I error, it does underestimate \(b_x\) for both \(b_x = 0.3\) (Table 14) and \(b_x = 0.5\) (Table 15). While the partial correlation method does show slight inflation of Type I error, it correctly estimates \(b_x\). For smaller correlations between \(Y\) and the covariates, we again see the benefit of selecting covariates rather than using all, especially for a larger \(X\) effect (Figure 20). Among the valid non-trivial selection methods, we see that full lm leads to the highest Type I error. When considering larger numbers of good covariates, we see the Type II error increasing for full lm, which in some cases, performed even worse than all covariates.
Overall, we can narrow the options for selection methods to bivariate correlation, partial correlation, full lm, and lasso. One of these methods may show more benefits to power depending on the research context. While they each have advantages and disadvantages, researchers should also consider the complexity of these methods when weighing the possible options for selecting covariates.